Proximal Policy Optimization 2 (PPO2) — from scratch in PyTorch#
This notebook builds a low-level PPO2 implementation in PyTorch and uses it to train an agent on a classic control environment.
Learning goals#
By the end you should be able to:
derive the PPO2 clipped objective and connect it to a trust-region intuition
implement PPO2 (rollout → GAE → multi-epoch mini-batch updates) in raw PyTorch
understand exactly how PPO2 differs from PPO1 (both in the paper and in Stable-Baselines naming)
plot episodic rewards and training diagnostics with Plotly
Prerequisites#
comfortable with gradients and backprop
basic RL notation: policy \(\pi_\theta(a\mid s)\), returns, value function \(V_\phi(s)\)
packages:
torch,gymnasium(orgym),numpy,plotly
import math
import time
from dataclasses import dataclass
from typing import Dict, List, Optional, Tuple
import numpy as np
import plotly.express as px
import plotly.graph_objects as go
import os
import plotly.io as pio
from plotly.subplots import make_subplots
import torch
import torch.nn as nn
import torch.nn.functional as F
from torch.distributions import Categorical, Independent, Normal
# Gymnasium first (new API), fallback to Gym (old API)
try:
import gymnasium as gym
except Exception: # pragma: no cover
import gym # type: ignore
pio.templates.default = 'plotly_white'
pio.renderers.default = os.environ.get("PLOTLY_RENDERER", "notebook")
np.set_printoptions(precision=4, suppress=True)
SEED = 42
rng = np.random.default_rng(SEED)
torch.manual_seed(SEED)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(SEED)
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
device
/home/tempa/miniconda3/lib/python3.12/site-packages/torch/cuda/__init__.py:174: UserWarning:
CUDA initialization: CUDA unknown error - this may be due to an incorrectly set up environment, e.g. changing env variable CUDA_VISIBLE_DEVICES after program start. Setting the available devices to be zero. (Triggered internally at /pytorch/c10/cuda/CUDAFunctions.cpp:109.)
device(type='cpu')
1) The RL objective (notation)#
We’ll use the standard episodic discounted-return objective:
\(\tau = (s_0, a_0, r_0, s_1, \dots)\) is a trajectory sampled by following the policy.
\(\gamma \in (0, 1]\) is the discount factor.
Two key helper objects:
Value function: \(V_\phi(s) \approx \mathbb{E}[\sum_{k\ge 0} \gamma^k r_{t+k} \mid s_t=s]\)
Advantage: \(A_t = Q(s_t, a_t) - V(s_t)\) — “how much better was this action than average?”
2) Policy gradients in one equation#
The policy-gradient theorem motivates the surrogate objective:
In practice we:
sample data with an old policy \(\pi_{\theta_{\text{old}}}\)
estimate advantages \(\hat{A}_t\) (often via GAE)
update the policy using mini-batch SGD.
3) Why PPO exists: “big steps” break policy gradients#
A vanilla policy-gradient update can change the policy too much.
PPO controls this by comparing the new policy to the old policy using the probability ratio:
If \(r_t(\theta)=1\) the new policy agrees with the old policy on that sampled action.
The classic importance-sampled surrogate (CPI) is:
The problem: maximizing this can push \(r_t\) to extreme values — effectively taking a too-large policy update.
4) PPO1 vs PPO2 (be precise about naming)#
People use “PPO1” vs “PPO2” in two different ways:
A) In the PPO paper (algorithmic variants)#
PPO-Penalty: adds a KL penalty \(\beta\,\mathrm{KL}(\pi_{\text{old}}\,\|\,\pi_\theta)\) and adapts \(\beta\).
PPO-Clip: uses a clipped surrogate objective (no explicit KL penalty term).
A common PPO-Penalty surrogate is:
with \(\beta\) tuned (often adaptively) to keep the KL near a target. PPO-Clip instead bakes the “keep it close” constraint into the objective via clipping.
Many blogs call these “PPO1” (penalty) and “PPO2” (clip). When this notebook says PPO2, it means PPO-Clip.
B) In OpenAI Baselines / Stable-Baselines (implementation families)#
Stable-Baselines historically exposes two codebases:
PPO1: an older MPI-oriented implementation (requiresmpi4py), with different batching and optimizer plumbing.PPO2: a newer implementation that supports vectorized envs and (optionally) value-function clipping (cliprange_vf).
Important nuance: Stable-Baselines PPO1 also uses the clipped surrogate; the “1 vs 2” there is mostly engineering, not the core objective.
Concretely, in Stable-Baselines:
PPO1is documented as an “MPI version”, with hyperparameters liketimesteps_per_actorbatch,optim_stepsize,optim_batchsize, and a learning-rateschedule.PPO2is documented as a “GPU version”, with hyperparameters liken_steps(per env),nminibatches,noptepochs, and the extracliprange_vfoption for value clipping.
If you’re comparing results across implementations, these differences (batch construction + optimizer details + value clipping) can matter even when the high-level PPO objective looks similar.
5) PPO2 clipped objective (the main idea)#
PPO2 replaces the CPI surrogate with the clipped surrogate:
Interpretation:
If \(\hat{A}_t > 0\) (action better than baseline), we don’t want \(r_t\) to grow far above \(1+\epsilon\).
If \(\hat{A}_t < 0\) (action worse than baseline), we don’t want \(r_t\) to shrink far below \(1-\epsilon\).
So PPO2 constrains the effective improvement you can get from any single sample.
Full loss (actor + critic + entropy)#
In practice we minimize the negative surrogate plus a value loss and an entropy bonus:
where \(\hat{R}_t\) are “return targets” (often \(\hat{A}_t + V(s_t)\)).
Value function clipping (SB/OpenAI variant)#
Stable-Baselines PPO2 optionally clips value updates (not in the original PPO paper):
and uses the max of the unclipped/clipped squared error.
# Visual intuition: how clipping changes the surrogate
eps = 0.2
ratios = np.linspace(0.0, 2.0, 600)
A_pos = 1.0
A_neg = -1.0
def clipped_surrogate(r, A, eps):
r_clipped = np.clip(r, 1.0 - eps, 1.0 + eps)
return np.minimum(r * A, r_clipped * A)
fig = make_subplots(
rows=1,
cols=2,
subplot_titles=(
'Surrogate term when $A_t > 0$',
'Surrogate term when $A_t < 0$',
),
)
for col, A in [(1, A_pos), (2, A_neg)]:
fig.add_trace(
go.Scatter(x=ratios, y=ratios * A, name='CPI: $rA$', line=dict(width=2)),
row=1,
col=col,
)
fig.add_trace(
go.Scatter(
x=ratios,
y=clipped_surrogate(ratios, A, eps),
name='PPO2: $\min(rA, \mathrm{clip}(r)A)$',
line=dict(width=3),
),
row=1,
col=col,
)
fig.add_vline(x=1.0 - eps, line=dict(color='gray', dash='dot'), row=1, col=col)
fig.add_vline(x=1.0 + eps, line=dict(color='gray', dash='dot'), row=1, col=col)
fig.update_layout(
title='PPO2 clipping limits how much any sample can improve the objective',
xaxis_title='$r_t(\theta)$',
height=380,
legend=dict(orientation='h', yanchor='bottom', y=-0.25, xanchor='left', x=0.0),
)
fig.update_xaxes(range=[0.0, 2.0])
fig.show()
<>:31: SyntaxWarning:
invalid escape sequence '\m'
<>:31: SyntaxWarning:
invalid escape sequence '\m'
/tmp/ipykernel_2186937/206611049.py:31: SyntaxWarning:
invalid escape sequence '\m'
6) Advantage estimation: GAE(\(\lambda\))#
A practical choice is Generalized Advantage Estimation:
\(\lambda \to 0\) → low variance, higher bias (more like TD)
\(\lambda \to 1\) → lower bias, higher variance (more like Monte Carlo)
We’ll also use \(\hat{R}_t = \hat{A}_t + V(s_t)\) as the target return for the critic.
7) Implementation roadmap (what we’ll code)#
PPO2 training loop per update:
Collect a rollout of length \(T\) (here:
n_steps) with the current policy.Compute values \(V(s_t)\), log-probs \(\log\pi(a_t\mid s_t)\), and rewards.
Compute GAE advantages \(\hat{A}_t\) and returns \(\hat{R}_t\).
For
n_epochsepochs:shuffle the rollout into mini-batches
optimize the clipped policy objective + value loss + entropy bonus.
We’ll log:
episodic returns (what you care about)
policy loss, value loss, entropy
approximate KL and clip fraction (sanity checks)
def env_reset(env, *, seed: Optional[int] = None):
out = env.reset(seed=seed) if seed is not None else env.reset()
if isinstance(out, tuple) and len(out) == 2:
obs, _info = out
return obs
return out
def env_step(env, action):
out = env.step(action)
# Gymnasium: (obs, reward, terminated, truncated, info)
if isinstance(out, tuple) and len(out) == 5:
obs, reward, terminated, truncated, info = out
done = bool(terminated) or bool(truncated)
return obs, float(reward), done, info
# Gym: (obs, reward, done, info)
obs, reward, done, info = out
return obs, float(reward), bool(done), info
def set_seed_everywhere(seed: int):
np.random.seed(seed)
torch.manual_seed(seed)
if torch.cuda.is_available():
torch.cuda.manual_seed_all(seed)
def explained_variance(y_pred: np.ndarray, y_true: np.ndarray) -> float:
"""1 - Var[y_true - y_pred] / Var[y_true]."""
y_true = np.asarray(y_true)
y_pred = np.asarray(y_pred)
var_y = np.var(y_true)
if var_y < 1e-12:
return float('nan')
return float(1.0 - np.var(y_true - y_pred) / var_y)
class ActorCritic(nn.Module):
def __init__(self, obs_dim: int, action_space, hidden_sizes=(64, 64)):
super().__init__()
self.obs_dim = int(obs_dim)
self.action_space = action_space
layers: List[nn.Module] = []
in_dim = self.obs_dim
for h in hidden_sizes:
layers.append(nn.Linear(in_dim, h))
layers.append(nn.Tanh())
in_dim = h
self.backbone = nn.Sequential(*layers)
# Discrete actions: categorical over logits
if isinstance(action_space, gym.spaces.Discrete):
self.is_discrete = True
self.n_actions = int(action_space.n)
self.actor = nn.Linear(in_dim, self.n_actions)
self.log_std = None
# Continuous actions: diagonal Gaussian
elif isinstance(action_space, gym.spaces.Box):
self.is_discrete = False
self.action_dim = int(np.prod(action_space.shape))
self.actor_mean = nn.Linear(in_dim, self.action_dim)
self.log_std = nn.Parameter(torch.zeros(self.action_dim))
else:
raise TypeError(f'Unsupported action space: {type(action_space)}')
self.critic = nn.Linear(in_dim, 1)
def _dist(self, obs: torch.Tensor):
h = self.backbone(obs)
if self.is_discrete:
logits = self.actor(h)
return Categorical(logits=logits)
mean = self.actor_mean(h)
std = torch.exp(self.log_std).expand_as(mean)
return Independent(Normal(mean, std), 1)
def value(self, obs: torch.Tensor) -> torch.Tensor:
h = self.backbone(obs)
return self.critic(h).squeeze(-1)
def act(self, obs: torch.Tensor, action: Optional[torch.Tensor] = None):
dist = self._dist(obs)
if action is None:
action = dist.sample()
log_prob = dist.log_prob(action)
entropy = dist.entropy()
value = self.value(obs)
return action, log_prob, entropy, value
@dataclass
class Rollout:
obs: np.ndarray
actions: np.ndarray
log_probs: np.ndarray
values: np.ndarray
rewards: np.ndarray
dones: np.ndarray
def make_rollout_storage(n_steps: int, obs_dim: int, action_space) -> Rollout:
obs = np.zeros((n_steps, obs_dim), dtype=np.float32)
rewards = np.zeros((n_steps,), dtype=np.float32)
dones = np.zeros((n_steps,), dtype=np.float32)
values = np.zeros((n_steps,), dtype=np.float32)
log_probs = np.zeros((n_steps,), dtype=np.float32)
if isinstance(action_space, gym.spaces.Discrete):
actions = np.zeros((n_steps,), dtype=np.int64)
elif isinstance(action_space, gym.spaces.Box):
act_dim = int(np.prod(action_space.shape))
actions = np.zeros((n_steps, act_dim), dtype=np.float32)
else:
raise TypeError(f'Unsupported action space: {type(action_space)}')
return Rollout(obs=obs, actions=actions, log_probs=log_probs, values=values, rewards=rewards, dones=dones)
def compute_gae(
rewards: np.ndarray,
dones: np.ndarray,
values: np.ndarray,
next_value: float,
*,
gamma: float,
gae_lambda: float,
) -> Tuple[np.ndarray, np.ndarray]:
"""Returns (advantages, returns)."""
n_steps = len(rewards)
advantages = np.zeros((n_steps,), dtype=np.float32)
last_gae = 0.0
for t in reversed(range(n_steps)):
next_nonterminal = 1.0 - dones[t]
next_v = next_value if t == n_steps - 1 else values[t + 1]
delta = rewards[t] + gamma * next_v * next_nonterminal - values[t]
last_gae = delta + gamma * gae_lambda * next_nonterminal * last_gae
advantages[t] = last_gae
returns = advantages + values
return advantages, returns
8) PPO2 update step (PyTorch)#
The heart of PPO2 is computing:
the ratio \(r_t(\theta)\) using old and new log-probs
the clipped surrogate
the value loss (optionally clipped)
the entropy bonus
and then doing standard backprop + optimizer step.
def ppo2_update(
model: ActorCritic,
optimizer: torch.optim.Optimizer,
*,
obs: torch.Tensor,
actions: torch.Tensor,
old_log_probs: torch.Tensor,
old_values: torch.Tensor,
advantages: torch.Tensor,
returns: torch.Tensor,
clip_coef: float,
vf_clip_coef: Optional[float],
ent_coef: float,
vf_coef: float,
max_grad_norm: float,
) -> Dict[str, float]:
action, log_prob, entropy, value = model.act(obs, action=actions)
log_ratio = log_prob - old_log_probs
ratio = torch.exp(log_ratio)
# Policy loss (clipped)
unclipped = ratio * advantages
clipped = torch.clamp(ratio, 1.0 - clip_coef, 1.0 + clip_coef) * advantages
policy_loss = -torch.mean(torch.min(unclipped, clipped))
# Value loss (optionally clipped, SB/OpenAI variant)
if vf_clip_coef is None:
value_loss = 0.5 * F.mse_loss(value, returns)
elif vf_clip_coef < 0:
# match original PPO paper: no value clipping
value_loss = 0.5 * F.mse_loss(value, returns)
else:
v_clipped = old_values + torch.clamp(value - old_values, -vf_clip_coef, vf_clip_coef)
v_loss1 = (value - returns).pow(2)
v_loss2 = (v_clipped - returns).pow(2)
value_loss = 0.5 * torch.mean(torch.max(v_loss1, v_loss2))
entropy_loss = -torch.mean(entropy)
loss = policy_loss + vf_coef * value_loss + ent_coef * entropy_loss
optimizer.zero_grad(set_to_none=True)
loss.backward()
nn.utils.clip_grad_norm_(model.parameters(), max_grad_norm)
optimizer.step()
approx_kl = torch.mean(-log_ratio).item()
clipfrac = torch.mean((torch.abs(ratio - 1.0) > clip_coef).float()).item()
return {
'loss': float(loss.item()),
'policy_loss': float(policy_loss.item()),
'value_loss': float(value_loss.item()),
'entropy': float(torch.mean(entropy).item()),
'approx_kl': float(approx_kl),
'clipfrac': float(clipfrac),
}
9) Train PPO2 on CartPole-v1#
We’ll keep this as close as possible to the textbook PPO2 recipe:
rollout length:
n_stepsmulti-epoch mini-batch SGD updates
GAE(\(\lambda\)) advantages (normalized)
plot episodic rewards
Tip: CartPole is fast. If you try harder environments, prefer vectorized envs (parallel rollouts) for more stable gradient estimates.
def train_ppo2(
*,
env_id: str = 'CartPole-v1',
total_timesteps: int = 150_000,
n_steps: int = 2048,
n_epochs: int = 10,
minibatch_size: int = 64,
gamma: float = 0.99,
gae_lambda: float = 0.95,
learning_rate: float = 3e-4,
clip_coef: float = 0.2,
vf_clip_coef: Optional[float] = None,
ent_coef: float = 0.0,
vf_coef: float = 0.5,
max_grad_norm: float = 0.5,
target_kl: Optional[float] = 0.03,
seed: int = 42,
) -> Dict[str, List[float]]:
set_seed_everywhere(seed)
env = gym.make(env_id)
obs0 = env_reset(env, seed=seed)
obs_dim = int(np.prod(env.observation_space.shape))
model = ActorCritic(obs_dim=obs_dim, action_space=env.action_space).to(device)
optimizer = torch.optim.Adam(model.parameters(), lr=learning_rate, eps=1e-5)
logs: Dict[str, List[float]] = {
'timesteps': [],
'episode_returns': [],
'policy_loss': [],
'value_loss': [],
'entropy': [],
'approx_kl': [],
'clipfrac': [],
'explained_variance': [],
}
obs = obs0
ep_return = 0.0
num_updates = math.ceil(total_timesteps / n_steps)
global_step = 0
for update in range(num_updates):
# Linear schedules (common PPO2 choice)
frac = 1.0 - (update / num_updates)
lr_now = learning_rate * frac
clip_now = clip_coef * frac
for pg in optimizer.param_groups:
pg['lr'] = lr_now
rollout = make_rollout_storage(n_steps=n_steps, obs_dim=obs_dim, action_space=env.action_space)
# Collect on-policy data
for t in range(n_steps):
rollout.obs[t] = np.asarray(obs, dtype=np.float32).reshape(-1)
obs_t = torch.tensor(rollout.obs[t], dtype=torch.float32, device=device).unsqueeze(0)
with torch.no_grad():
action_t, logp_t, _ent_t, value_t = model.act(obs_t)
if model.is_discrete:
action = int(action_t.item())
else:
action = action_t.squeeze(0).cpu().numpy().astype(np.float32)
next_obs, reward, done, _info = env_step(env, action)
rollout.actions[t] = action
rollout.log_probs[t] = float(logp_t.item())
rollout.values[t] = float(value_t.item())
rollout.rewards[t] = float(reward)
rollout.dones[t] = float(done)
ep_return += reward
global_step += 1
obs = next_obs
if done:
logs['episode_returns'].append(float(ep_return))
ep_return = 0.0
obs = env_reset(env)
# Bootstrap value for the last observation
obs_last = torch.tensor(np.asarray(obs, dtype=np.float32).reshape(-1), device=device).unsqueeze(0)
with torch.no_grad():
next_value = float(model.value(obs_last).item())
adv_np, ret_np = compute_gae(
rewards=rollout.rewards,
dones=rollout.dones,
values=rollout.values,
next_value=next_value,
gamma=gamma,
gae_lambda=gae_lambda,
)
# Flatten batch tensors
b_obs = torch.tensor(rollout.obs, dtype=torch.float32, device=device)
if model.is_discrete:
b_actions = torch.tensor(rollout.actions, dtype=torch.int64, device=device)
else:
b_actions = torch.tensor(rollout.actions, dtype=torch.float32, device=device)
b_old_logp = torch.tensor(rollout.log_probs, dtype=torch.float32, device=device)
b_old_values = torch.tensor(rollout.values, dtype=torch.float32, device=device)
b_adv = torch.tensor(adv_np, dtype=torch.float32, device=device)
b_returns = torch.tensor(ret_np, dtype=torch.float32, device=device)
# Advantage normalization is standard PPO2 practice
b_adv = (b_adv - b_adv.mean()) / (b_adv.std() + 1e-8)
# PPO update: multiple epochs over the same on-policy batch
batch_indices = np.arange(n_steps)
metrics_accum = {
'policy_loss': [],
'value_loss': [],
'entropy': [],
'approx_kl': [],
'clipfrac': [],
}
for epoch in range(n_epochs):
rng.shuffle(batch_indices)
for start in range(0, n_steps, minibatch_size):
mb_idx = batch_indices[start : start + minibatch_size]
out = ppo2_update(
model,
optimizer,
obs=b_obs[mb_idx],
actions=b_actions[mb_idx],
old_log_probs=b_old_logp[mb_idx],
old_values=b_old_values[mb_idx],
advantages=b_adv[mb_idx],
returns=b_returns[mb_idx],
clip_coef=float(clip_now),
vf_clip_coef=vf_clip_coef if vf_clip_coef is not None else None,
ent_coef=float(ent_coef),
vf_coef=float(vf_coef),
max_grad_norm=float(max_grad_norm),
)
for k in metrics_accum:
metrics_accum[k].append(out[k])
# Optional early stopping if KL explodes (common safety valve)
if target_kl is not None and np.mean(metrics_accum['approx_kl']) > 1.5 * target_kl:
break
# Logging at update granularity
logs['timesteps'].append(float(global_step))
logs['policy_loss'].append(float(np.mean(metrics_accum['policy_loss'])))
logs['value_loss'].append(float(np.mean(metrics_accum['value_loss'])))
logs['entropy'].append(float(np.mean(metrics_accum['entropy'])))
logs['approx_kl'].append(float(np.mean(metrics_accum['approx_kl'])))
logs['clipfrac'].append(float(np.mean(metrics_accum['clipfrac'])))
logs['explained_variance'].append(explained_variance(rollout.values, ret_np))
env.close()
return logs
# Run training (adjust total_timesteps if you're on CPU and want it faster)
logs = train_ppo2(
env_id='CartPole-v1',
total_timesteps=120_000,
n_steps=1024,
n_epochs=10,
minibatch_size=64,
learning_rate=3e-4,
ent_coef=0.0,
vf_clip_coef=0.2, # SB/OpenAI-style value clipping (set -1 to disable)
)
len(logs['episode_returns']), logs['episode_returns'][:5]
(726, [19.0, 24.0, 34.0, 26.0, 23.0])
# Plot episodic rewards (and a rolling mean)
episode_returns = np.asarray(logs['episode_returns'], dtype=np.float32)
episodes = np.arange(1, len(episode_returns) + 1)
window = 25
if len(episode_returns) >= window:
rolling = np.convolve(episode_returns, np.ones(window) / window, mode='valid')
rolling_x = np.arange(window, len(episode_returns) + 1)
else:
rolling = episode_returns
rolling_x = episodes
fig = go.Figure()
fig.add_trace(go.Scatter(x=episodes, y=episode_returns, mode='lines', name='Episode return'))
fig.add_trace(go.Scatter(x=rolling_x, y=rolling, mode='lines', name=f'Rolling mean ({window})', line=dict(width=4)))
fig.update_layout(
title='PPO2 on CartPole-v1: episodic reward over training',
xaxis_title='Episode',
yaxis_title='Episodic return',
height=420,
)
fig.show()
# Plot training diagnostics per update
df = {
'update': np.arange(len(logs['timesteps'])),
'timesteps': np.asarray(logs['timesteps']),
'policy_loss': np.asarray(logs['policy_loss']),
'value_loss': np.asarray(logs['value_loss']),
'entropy': np.asarray(logs['entropy']),
'approx_kl': np.asarray(logs['approx_kl']),
'clipfrac': np.asarray(logs['clipfrac']),
'explained_variance': np.asarray(logs['explained_variance']),
}
fig = make_subplots(
rows=2,
cols=3,
subplot_titles=(
'Policy loss',
'Value loss',
'Entropy',
'Approx KL',
'Clip fraction',
'Explained variance',
),
)
def add_line(row, col, y, name):
fig.add_trace(go.Scatter(x=df['update'], y=y, mode='lines', name=name), row=row, col=col)
add_line(1, 1, df['policy_loss'], 'policy_loss')
add_line(1, 2, df['value_loss'], 'value_loss')
add_line(1, 3, df['entropy'], 'entropy')
add_line(2, 1, df['approx_kl'], 'approx_kl')
add_line(2, 2, df['clipfrac'], 'clipfrac')
add_line(2, 3, df['explained_variance'], 'explained_variance')
fig.update_layout(title='Training diagnostics (per PPO update)', height=560, showlegend=False)
fig.update_xaxes(title_text='Update')
fig.show()
10) Stable-Baselines PPO2 (reference implementation)#
Stable-Baselines (the TensorFlow library, now in maintenance mode) provides a PPO2 class.
Example from the Stable-Baselines docs (CartPole with a vectorized env):
import gym
from stable_baselines.common.policies import MlpPolicy
from stable_baselines.common import make_vec_env
from stable_baselines import PPO2
env = make_vec_env('CartPole-v1', n_envs=4)
model = PPO2(MlpPolicy, env, verbose=1)
model.learn(total_timesteps=25000)
model.save('ppo2_cartpole')
We’ll list and explain the Stable-Baselines PPO2 hyperparameters in the next section.
11) Stable-Baselines PPO2 hyperparameters (explained)#
Stable-Baselines PPO2 (TensorFlow) exposes the following constructor signature (from stable_baselines/ppo2/ppo2.py):
PPO2(
policy,
env,
gamma=0.99,
n_steps=128,
ent_coef=0.01,
learning_rate=2.5e-4,
vf_coef=0.5,
max_grad_norm=0.5,
lam=0.95,
nminibatches=4,
noptepochs=4,
cliprange=0.2,
cliprange_vf=None,
verbose=0,
tensorboard_log=None,
_init_setup_model=True,
policy_kwargs=None,
full_tensorboard_log=False,
seed=None,
n_cpu_tf_sess=None,
)
What each hyperparameter does#
policy: policy class (or registered string) likeMlpPolicy,CnnPolicy,MlpLstmPolicy.env: Gym env instance or an env id string (e.g.'CartPole-v1').gamma: discount factor \(\gamma\).n_steps: rollout horizon per env per update. With vectorized envs, the batch size is:\[ n_{\text{batch}} = n_{\text{steps}} \cdot n_{\text{envs}} \]ent_coef: entropy coefficient \(c_e\) (larger → more exploration pressure).learning_rate: learning rate (float) or a schedule function of training progress.vf_coef: value-loss coefficient \(c_v\).max_grad_norm: global gradient norm clip threshold.lam: GAE(\(\lambda\)) parameter.nminibatches: number of minibatches per update (minibatch size isn_batch / nminibatches). For recurrent policies, SB recommendsn_envsbe a multiple ofnminibatches.noptepochs: number of epochs over the on-policy batch per update.cliprange: PPO clip parameter \(\epsilon\) (float) or a schedule.cliprange_vf: value-function clipping range.None(default): reusecliprangefor the value function (OpenAI baselines legacy behavior).negative value (e.g.
-1): disable value clipping (closer to the original PPO paper).positive float/schedule: enable value clipping with that range.
Note: value clipping depends on reward scaling.
verbose: logging verbosity.tensorboard_log: TensorBoard log directory (orNone)._init_setup_model: whether to build the TF graph at init.policy_kwargs: extra kwargs forwarded to the policy network constructor.full_tensorboard_log: log additional tensors/histograms (large disk usage).seed: random seed (Python/NumPy/TF). For fully deterministic TF runs, SB notes you should setn_cpu_tf_sess=1.n_cpu_tf_sess: number of TensorFlow threads.
Mapping to this notebook#
SB
n_steps→ this notebook’sn_stepsSB
noptepochs→ this notebook’sn_epochsSB
nminibatches→ this notebook’sminibatch_size = n_steps / nminibatches(single-env case)SB
cliprange→ this notebook’sclip_coefSB
cliprange_vf→ this notebook’svf_clip_coef